Search CORE

139 research outputs found

An Arabic-Hebrew parallel corpus of TED talks

Author: Cettolo Mauro
Publication venue: place:Stroudsburg (US-PA)
Publication date: 01/01/2016
Field of study

We describe an Arabic-Hebrew parallel corpus of TED talks built upon WIT3, the Web inventory that repurposes the original content of the TED website in a way which is more convenient for MT researchers. The benchmark consists of about 2,000 talks, whose subtitles in Arabic and Hebrew have been accurately aligned and rearranged in sentences, for a total of about 3.5M tokens per language. Talks have been partitioned in train, development and test sets similarly in all respects to the MT tasks of the IWSLT 2016 evaluation campaign. In addition to describing the benchmark, we list the problems encountered in preparing it and the novel methods designed to solve them. Baseline MT results and some measures on sentence length are provided as an extrinsic evaluation of the quality of the benchmark

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

The ITC-irst statistical machine translation system for IWSLT-2004

Author: Marcello Federico
Mauro Cettolo
Nicola Bertoldi
Roldano Cattoni
Publication venue
Publication date
Field of study

Focus of this paper is the system for statistical machine translation developed at ITC-irst. It has been employed in the evaluation campaign of the International Workshop on Spoken Language Translation 2004 in all the three data set conditions of the Chinese-English track. Both the statistical model underlying the system and the system architecture are presented. Moreover, details are given on how the submitted runs have been produced. 1

CiteSeerX

Archivio della ricerca - Fondazione Bruno Kessler

CTC-based Compression for Direct Speech Translation

Author: Cettolo Mauro
Gaido Marco
Negri Matteo
Turchi Marco
Publication venue
Publication date: 01/01/2021
Field of study

Previous studies demonstrated that a dynamic phone-informed compression of the input audio is beneficial for speech translation (ST). However, they required a dedicated model for phone recognition and did not test this solution for direct ST, in which a single model translates the input audio into the target language without intermediate representations. In this work, we propose the first method able to perform a dynamic compression of the input indirect ST models. In particular, we exploit the Connectionist Temporal Classification (CTC) to compress the input sequence according to its phonetic characteristics. Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement over a strong baseline on two language pairs (English-Italian and English-German), contextually reducing the memory footprint by more than 10%.Comment: Accepted at EACL202

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

Evaluating Subtitle Segmentation for End-to-end Generation Systems

Author: Alina Karakanta
François Buet
François Yvon
Mauro Cettolo
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2022
Field of study

Subtitles appear on screen as short pieces of text, segmented based on formal constraints (length) and syntactic/semantic criteria. Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference. However, standard segmentation metrics cannot be applied when systems generate outputs different than the reference, e.g. with end-to-end subtitling systems. In this paper, we study ways to conduct reference-based evaluations of segmentation accuracy irrespective of the textual content. We first conduct a systematic analysis of existing metrics for evaluating subtitle segmentation. We then introduce Sigma, a new Subtitle Segmentation Score derived from an approximate upper-bound of BLEU on segmentation boundaries, which allows us to disentangle the effect of good segmentation from text quality. To compare Sigma with existing metrics, we further propose a boundary projection method from imperfect hypotheses to the true reference. Results show that all metrics are able to reward high quality output but for similar outputs system ranking depends on each metric’s sensitivity to error type. Our thorough analyses suggest Sigma is a promising segmentation candidate but its reliability over other segmentation metrics remains to be validated through correlations with human judgements

Archivio della ricerca - Fondazione Bruno Kessler

No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation

Author: Bentivogli Luisa
Cettolo Mauro
Fucci Dennis
Gaido Marco
Negri Matteo
Publication venue
Publication date: 10/10/2023
Field of study

Automatic speech recognition (ASR) systems are known to be sensitive to the sociolinguistic variability of speech data, in which gender plays a crucial role. This can result in disparities in recognition accuracy between male and female speakers, primarily due to the under-representation of the latter group in the training data. While in the context of hybrid ASR models several solutions have been proposed, the gender bias issue has not been explicitly addressed in end-to-end neural architectures. To fill this gap, we propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. This technique reduces the data unbalance among genders by simulating voices of the under-represented female speakers and increases the variability within each gender group. Experiments on spontaneous English speech show that our technique yields a relative WER improvement up to 9.87% for utterances by female speakers, with larger gains for the least-represented f0 ranges.Comment: Accepted at ASRU 202

arXiv.org e-Print Archive

Towards a methodology for evaluating automatic subtitling

Author: Alina Karakanta
Luisa Bentivogli
Marco Turchi
Matteo Negri
Mauro Cettolo
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

In response to the increasing interest towards automatic subtitling, this EAMT-funded project aimed at collecting subtitle post-editing data in a real use case scenario where professional subtitlers edit automatically generated subtitles. The post-editing setting includes, for the first time, automatic generation of timestamps and segmentation, and focuses on the effect of timing and segmentation edits on the post-editing process. The collected data will serve as the basis for investigating how subtitlers interact with automatic subtitling and for devising evaluation methods geared to the multimodal nature and formal requirements of subtitling

Archivio della ricerca - Fondazione Bruno Kessler

The repetition rate of text as a predictor of the effectiveness of machine translation adaptation

Author: Marcello Federico
Mauro Cettolo
Nicola Bertoldi
Publication venue
Publication date: 01/01/2014
Field of study

Abstract Since the effectiveness of MT adaptation relies on the text repetitiveness, the question on how to measure repetitions in a text naturally arises. This work deals with the issue of looking for and evaluating text features that might help the prediction of the impact of MT adaptation on translation quality. In particular, the repetition rate metric, we recently proposed, is compared to other features employed in very related NLP tasks. The comparison is carried out through a regression analysis between feature values and MT performance gains by dynamically adapted versus non-adapted MT engines, on five different translation tasks. The main outcome of experiments is that the repetition rate correlates better than any other considered feature with the MT gains yielded by the online adaptation, although using all features jointly results in better predictions than with any single feature

CiteSeerX

Methods for Smoothing the Optimizer Instability in SMT

Author: Bertoldi Nicola
Cettolo Mauro
Federico Marcello
Publication venue
Publication date: 01/01/2011
Field of study

In SMT, the instability of MERT, the commonly used optimizer, is an acknowledged problem. This paper presents two methods for smoothing the MERT instability. Both exploit a set of different realizations of the same system obtained by running the optimization stage multiple times. One method averages the sets of different optimal weights; the other combines the translations generated by the various realizations. Experiments conducted on two different sized tasks involving four different language pairs show that both methods are effective in smoothing instability, but also that the average system well competes with the more expensive system combination

Archivio della ricerca - Fondazione Bruno Kessler

Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection

Author: Bentivogli Luisa
Cettolo Mauro
Fucci Dennis
Gaido Marco
Negri Matteo
Papi Sara
Publication venue
Publication date: 24/10/2023
Field of study

When translating words referring to the speaker, speech translation (ST) systems should not resort to default masculine generics nor rely on potentially misleading vocal traits. Rather, they should assign gender according to the speakers' preference. The existing solutions to do so, though effective, are hardly feasible in practice as they involve dedicated model re-training on gender-labeled ST data. To overcome these limitations, we propose the first inference-time solution to control speaker-related gender inflections in ST. Our approach partially replaces the (biased) internal language model (LM) implicitly learned by the ST decoder with gender-specific external LMs. Experiments on en->es/fr/it show that our solution outperforms the base models and the best training-time mitigation strategy by up to 31.0 and 1.6 points in gender accuracy, respectively, for feminine forms. The gains are even larger (up to 32.0 and 3.4) in the challenging condition where speakers' vocal traits conflict with their gender.Comment: Accepted at EMNLP 202

arXiv.org e-Print Archive

Extending the MuST-C Corpus for a Comparative Evaluation of Speech Translation Technology

Author: Alina Karakanta
Luisa Bentivogli
Marco Gaido
Marco Turchi
Matteo Negri
Mauro Cettolo
Publication venue: European Association for Machine Translation
Publication date: 01/01/2022
Field of study

This project aimed at extending the test sets of the MuST-C speech translation (ST) corpus with new reference translations. The new references were collected from professional post-editors working on the output of different ST systems for three language pairs: English-German/Italian/Spanish. In this paper, we shortly describe how the data were collected and how they are distributed. As an evidence of their usefulness, we also summarise the findings of the first comparative evaluation of cascade and direct ST approaches, which was carried out relying on the collected data. The project was partially funded by the European Association for Machine Translation (EAMT) through its 2020 Sponsorship of Activities programme

Archivio della ricerca - Fondazione Bruno Kessler